Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility Issue with Chinese Text in Document Parsing #3530

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

Coniferish
Copy link
Collaborator

Duplicate of #3267 since forked PRs are failing to pass chipper CI tests

JIAQIA and others added 30 commits May 23, 2024 19:46
…apply to text type classification

- Added a `languages` attribute to the Document base class. This attribute is essential to express the current language nature of a document, as language issues are encountered in various methods across the document. Having a common language array as a default value is necessary, and this attribute also partially meets the requirements of domain-driven design.
- Added `languages` option to `DocxPartitionerOptions` to specify a list of languages to use for text type classification.
- Modified `_DocxPartitioner.detect_text_type()` to use the specified languages or automatically detect the languages if "auto" is specified.
- This allows the partitioner to more accurately classify text elements based on the language, improving the overall partitioning quality.
- For HTML and MD (MD utilizes the HTML partition method), the `languages` field is passed through the entire construction chain until it is finally used in the `is_possible_narrative_text` and `is_possible_title` functions. Previously, although these two functions supported different judgments for different languages, the `languages` parameter was not correctly passed, which led to this capability not being enabled. This update enables this capability.
- **BREAKING CHANGE**: The `DocxPartitionerOptions` constructor and some other partition functions now require a new `languages` parameter. This is a breaking change for any existing code. However, since most parameters have default values, it is not entirely a breaking change. This is merely a warning. In fact, docx and md test cases have been retested and passed, and simple test cases for the new feature have been submitted to ensure the functionality works correctly.

---

### feat(unstructured/partition/docx.py): 添加语言检测并应用于文本类型分类

- 在 Document 基础类中添加了 `languages` 属性。文档应该具有一个类似的属性来表达文档当前的语言性质,因为在文档的各个方法中都会遇到语言问题。在这些场景中,有一个公共的语言数组作为默认值是必要的,而且这个属性在某种程度上也满足了领域驱动设计的要求。
- 在 `DocxPartitionerOptions` 中添加了 `languages` 选项,用于指定用于文本类型分类的语言列表。
- 修改了 `_DocxPartitioner.detect_text_type()`,以使用指定的语言或在指定为 "auto" 时自动检测语言。
- 这使得分区器能够更准确地基于语言对文本元素进行分类,从而提高整体分区质量。
- 对于 HTML 和 MD(MD 利用了 HTML 的分区方法),`languages` 字段在整个构造链中一路传递,直到在 `is_possible_narrative_text` 和 `is_possible_title` 函数中最终使用。此前,虽然这两个函数支持针对不同语言进行不同的判断,但 `languages` 参数没有正确传递,这导致这一能力一直未被启用。本次更新启用了这一能力。
- **破坏性更改**: `DocxPartitionerOptions` 构造函数和其他一些分区函数现在需要一个新的 `languages` 参数。这对于现有的代码是一个破坏性更改。然而,由于大多数参数都有默认值,所以并不完全算是破坏性更新,这仅是一个警告。实际上,docx 和 md 的测试用例已经重新测试并通过,同时针对新的功能也提交了简单的测试用例以确保功能正常运行。
进行了全量测试,并基本保持了与main分支一致的通过率。
… "DocxPartitionerOptions" to collapse into keyword arguments (kwargs).

2. Change "capitalizable_languages" to "non_capitalizable_languages" in the function "is_possible_narrative_text".
…ure/zh_adaptation

# Conflicts:
#	CHANGELOG.md
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
# Conflicts:
#	test_unstructured/documents/test_html.py
#	unstructured/documents/base.py
#	unstructured/documents/html.py
#	unstructured/documents/xml.py
#	unstructured/partition/epub.py
#	unstructured/partition/html.py
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
…e code formatting

- Update CHANGELOG.md to include compatibility issue fix for Chinese text in document parsing.
- Reformat import statements in test_odt.py for better readability.
- Adjust import order in html.py to adhere to PEP8 guidelines.
- Add `languages` parameter to text processing functions in pdf.py and text.py for improved language handling.
- Reformat long lines to improve code readability and maintain consistency.

Co-authored-by: Your Name <[email protected]>
# Conflicts:
#	unstructured/documents/html.py
#	unstructured/partition/html.py
…cOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS.
…cOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS.
Added logic in the `test_weaviate_schema_is_valid` test function to check the existing Weaviate schema. If the class to be created already exists, the creation step is skipped and a corresponding message is printed to avoid creating a duplicate class.
# Conflicts:
#	CHANGELOG.md
#	examples/pgvector/pgvector.ipynb
#	examples/training/0-Core Concepts.ipynb
#	examples/training/1-Intro to Bricks.ipynb
#	examples/training/2-File Exploration.ipynb
#	examples/weaviate/weaviate.ipynb
#	test_unstructured/partition/test_auto.py
#	unstructured/documents/html.py
解决了中文测试文档中的一些格式问题。
Add change log
Add change log
…zh_adaptation

# Conflicts:
#	CHANGELOG.md
#	unstructured/__version__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants